Search CORE

22 research outputs found

Speculative Staging for Interpreter Optimization

Author: Brunthaler Stefan
Publication venue
Publication date: 08/10/2013
Field of study

Interpreters have a bad reputation for having lower performance than just-in-time compilers. We present a new way of building high performance interpreters that is particularly effective for executing dynamically typed programming languages. The key idea is to combine speculative staging of optimized interpreter instructions with a novel technique of incrementally and iteratively concerting them at run-time. This paper introduces the concepts behind deriving optimized instructions from existing interpreter instructions---incrementally peeling off layers of complexity. When compiling the interpreter, these optimized derivatives will be compiled along with the original interpreter instructions. Therefore, our technique is portable by construction since it leverages the existing compiler's backend. At run-time we use instruction substitution from the interpreter's original and expensive instructions to optimized instruction derivatives to speed up execution. Our technique unites high performance with the simplicity and portability of interpreters---we report that our optimization makes the CPython interpreter up to more than four times faster, where our interpreter closes the gap between and sometimes even outperforms PyPy's just-in-time compiler.Comment: 16 pages, 4 figures, 3 tables. Uses CPython 3.2.3 and PyPy 1.

arXiv.org e-Print Archive

CiteSeerX

IT-Strukturen für die Lebensmittel-Rückverfolgbarkeit mit RFID im Mittelstand der Transportindustrie

Author: Brunthaler Stefan
Publication venue: Technische Hochschule Wildau
Publication date: 01/01/2008
Field of study

Rückverfolgbarkeit ist eine durch EU-Richtlinien (z. B. 178/ 2002) und nationale Gesetze vorgeschriebene Eigenschaft, die Logistikketten (Supply Chains) für Lebensmittel seit Anfang 2005 haben müssen. Dies bedeutet kurz gefasst, dass jedes Produkt in allen seinen Komponenten lückenlos durch die gesamte Logistikkette rückverfolgbar sein muss. Der typische Anwendungsfall (und die Motivation der Politik für diese Gesetzgebung) ist die Aussonderung verdorbener Chargen eines Produktes, bevor die Bevölkerung in großem Maße betroffen ist. Um diese Anwendung tatsächlich in Echtzeit umsetzen zu können, werden informationsverarbeitende Systeme benötigt, die für alle Teilnehmer an der Logistikkette sowie auch für weitere Interessengruppen wie Verbraucherverbände oder öffentliche Einrichtungen einfach, preisgünstig, sicher und mit jeweils individuellen Berechtigungen zugänglich sind. Ein solches System wird in diesem Beitrag vorgeschlagen.Traceability is a feature for food supply chains which must be implemented and is regulated in EU standards (e. g. 178/200) and national laws. Participants in supply chains must co-operate to fulfill these standards, but small or medium companies often do not have the IT infrastructure to do that on a regular and cost effective basis. This article gives an outline for an open IT system structure (»DOTS«) to help participants in supply chains to communicate with respect to traceability of food products and transport units

Inline Caching Meets Quickening. In

Author: Stefan Brunthaler
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2010
Field of study

Abstract. Inline caches effectively eliminate the overhead implied by dynamic typing. Yet, inline caching is mostly used in code generated by just-in-time compilers. We present efficient implementation techniques for using inline caches without dynamic translation, thus enabling future interpreter implementers to use this important optimization techniquewe report speedups of up to a factor of 1.71-without the additional implementation and maintenance costs incurred by using a just-in-time compiler

CiteSeerX

Opaque control-flow integrity.

Author: Kevin W Hamlen
Michael Franz
Per Larsen † Stefan Brunthaler
Vishwath Mohan
Publication venue
Publication date: 01/01/2015
Field of study

Abstract-A new binary software randomization and ControlFlow Integrity (CFI) enforcement system is presented, which is the first to efficiently resist code-reuse attacks launched by informed adversaries who possess full knowledge of the inmemory code layout of victim programs. The defense mitigates a recent wave of implementation disclosure attacks, by which adversaries can exfiltrate in-memory code details in order to prepare code-reuse attacks (e.g., Return-Oriented Programming (ROP) attacks) that bypass fine-grained randomization defenses. Such implementation-aware attacks defeat traditional fine-grained randomization by undermining its assumption that the randomized locations of abusable code gadgets remain secret. Opaque CFI (O-CFI) overcomes this weakness through a novel combination of fine-grained code-randomization and coarsegrained control-flow integrity checking. It conceals the graph of hijackable control-flow edges even from attackers who can view the complete stack, heap, and binary code of the victim process. For maximal efficiency, the integrity checks are implemented using instructions that will soon be hardware-accelerated on commodity x86-x64 processors. The approach is highly practical since it does not require a modified compiler and can protect legacy binaries without access to source code. Experiments using our fully functional prototype implementation show that O-CFI provides significant probabilistic protection against ROP attacks launched by adversaries with complete code layout knowledge, and exhibits only 4.7% mean performance overhead on current hardware (with further overhead reductions to follow on forthcoming Intel processors). I. MOTIVATION Code-reuse attacks (cf., Permission to freely reproduce all or part of this paper for noncommercial purposes is granted provided that copies bear this notice and the full citation on the first page. Reproduction for commercial purposes is strictly prohibited without the prior written consent of the Internet Society, the first-named author (for reproduction of an entire paper only), and the author's employer if the paper was prepared within the scope of employment. This has motivated copious work on defenses against codereuse threats. Prior defenses can generally be categorized into: CFI [1] and artificial software diversity CFI restricts all of a program's runtime control-flows to a graph of whitelisted control-flow edges. Usually the graph is derived from the semantics of the program source code or a conservative disassembly of its binary code. As a result, CFIprotected programs reject control-flow hijacks that attempt to traverse edges not supported by the original program's semantics. Fine-grained CFI monitors indirect control-flows precisely; for example, function callees must return to their exact callers. Although such precision provides the highest security, it also tends to incur high performance overheads (e.g., 21% for precise caller-callee return-matching [1]). Because this overhead is often too high for industry adoption, researchers have proposed many optimized, coarser-grained variants of CFI. Coarse-grained CFI trades some security for better performance by reducing the precision of the checks. For example, functions must return to valid call sites (but not necessarily to the particular site that invoked the callee). Unfortunately, such relaxations have proved dangerous-a number of recent proof-of-concept exploits have shown how even minor relaxations of the control-flow policy can be exploited to effect attacks Artificial software diversity offers a different but complementary approach that randomizes programs in such a way that attacks succeeding against one program instance have a very low probability of success against other (independently randomized) instances of the same program. Probabilistic defenses rely on memory secrecy-i.e., the effects of randomization must remain hidden from attackers. One of the simplest and most widely adopted forms of artificial diversity is Address Space Layout Randomization (ASLR), which randomizes the base addresses of program segments at loadtime. Unfortunately, merely randomizing the base addresses does not yield sufficient entropy to preserve memory secrecy in many cases; there are numerous successful derandomization attacks against ASLR Recently, a new wave of implementation disclosure attacks Experiments show that O-CFI enjoys performance overheads comparable to standard fine-grained diversity and non-opaque, coarse-grained CFI. Moreover, O-CFI's control-flow checking logic is implemented using Intel x86/x64 memory-protection extensions (MPX) that are expected to be hardware-accelerated in commodity CPUs from 2015 onwards. We therefore expect even better performance for O-CFI in the near future. Our contributions are as follows: • We introduce O-CFI, the first low-overhead code-reuse defense that tolerates implementation disclosures. • We describe our implementation of a fully functional prototype that protects stripped, x86 legacy binaries without source code. II. THREAT MODEL Our work is motivated by the emergence of attacks against fine-grained diversity and coarse-grained control-flow integrity. We therefore introduce these attacks and distill them into a single, unified threat model. A. Bypassing Coarse-Grained CFI Ideally, CFI permits only programmer-intended control-flow transfers during a program's execution. The typical approach is to assign a unique ID to each permissible indirect controlflow target, and check the IDs at runtime. Unfortunately, this introduces performance overhead proportional to the degree of the graph-the more overlaps between valid target sets of indirect branch instructions, the more IDs must be stored and checked at each branch. Moreover, perfect CFI cannot be realized with a purely static control-flow graph; for example, the permissible destinations of function returns depend on the calling context, which is only known at runtime. Fine-grained CFI therefore implements a dynamically computed shadow stack, incurring high overheads To avoid this, coarse-grained CFI implementations resort to a reduced-degree, static approximation of the control-flow graph, and merge identifiers at the cost of reduced security. Typically, attackers need more than a single 4K page worth of code to find enough gadgets to mount a code-reuse attack. To discourage brute-force searches for more code pages, artificial diversity defenses routinely mine the address space with unmapped pages that abort the process if accessed B. Assumptions Given these sobering realities, we adopt a conservative threat model that assumes that attackers will eventually find and disassemble all code pages in victim processes. Our threat model therefore assumes that the adversary knows the complete in-memory code layout-including the locations of any gadgets required to launch a ROP attack. We also assume that the attacker can read and write the full contents of the heap and stack, as well as any data structures used by the dynamic loader. In keeping with common practice, we assume that data execution protection is activated, so that code page permissions can be maintained as either writable or executable but not both. However, we assume that attackers cannot safely perform a comprehensive, linear scan of virtual memory, since defenders may place unmapped guard pages at random locations. Instead, attackers must follow references from one disclosed memory page to another III. O-CFI OVERVIEW O-CFI combines insights from CFI and automated software diversity. It extends CFI with a new, coarse-grained CFI enforcement strategy inspired by bounds-checking, that validates control-flow transfers without divulging the bounds against which their destinations are checked. Bounds-checking is fast, the bounds are easier to conceal than arbitrary gadget locations, and the bounds are randomizable. This imbues CFI and fine-grained software diversity with an additional layer of protection against code-reuse attacks aided by implementation disclosures. As a result, O-CFI enjoys performance similar to coarse-grained CFI, with probabilistic security guarantees similar to fine-grained artificial diversity in the absence of implementation disclosures. Following traditional CFI, an O-CFI policy assigns to each indirect branch site a destination set that captures its set of permissible destination addresses. Such a graph can be derived from the program's source code or (with lesser precision) a conservative disassembly of its object code. We next reformulate this policy as a bounds-checking problem by reducing each destination set to only its minimal and maximal members. This policy approximation can be efficiently enforced by confining each branch to the memory-aligned addresses within its destination set range. All intended destination addresses are aligned within these bounds, so the enforcement conservatively preserves intended control-flows. Code layout is optimized to tighten the bounds, so that the set of unintended, aligned destinations within the bounds remains minimal. These few remaining unintended but reachable destinations are protected by the artificial diversity half of our approach. Our artificial diversity approach probabilistically protects the aligned, in-bounds, but policy-violating control-flows by applying fine-grained randomization to the binary code at load-time. While the overall strategy for implementing this randomization step is based on prior works Reformulating CFI in this way forces attackers to change their plan of attack. The recent attacks against coarse-grained CFI succeed by finding exploitable code that is reachable due to policy-relaxations needed for acceptable performance. These relaxations admit an alarming array of false-positives: instead of identifying the actual caller, all call-preceded instructions are incorrectly identified as permitted branch destinations. Such instructions saturate a typical address space, giving attackers too much wiggle room to build attacks. O-CFI counters this by changing the approximation approach: each branch destination is restricted to a relatively short span of aligned addresses, with all the bounds chosen pseudo-randomly at load-time. This greatly narrows the field of possible hijacks, and it removes the opportunity for attackers to analyze programs ahead of time for viable ROP gadget chains. In O-CFI, no two program instances admit the same set of ROP payloads, since the bounds are all randomized every time the program is loaded. Since the security of coarse-grained CFI depends in part on the precision of its policy approximation, it is worthwhile to improve the precision by tightening the bounds imposed upon each branch. This effectively reduces the space of attacker guesses that might succeed in hijacking any given branch. To reduce this space as much as possible, we introduce a novel binary code optimization, called portals, that minimizes the distance covered by the lowest and greatest element of each indirect branch's destination set. Our fine-grained artificial diversity implementation is an adaptation and extension of binary stirring To protect against information leaks that might disclose bounds information, our implementation is carefully designed to keep all bounds opaque to external threats. They are randomly chosen at load-time (as a side-effect of binary stirring) and stored in a bounds lookup table (BLT) located at a randomly chosen base address. The table size is very small relative to the virtual address space, and attackers cannot safely perform bruteforce scans of the full address space (see §II-B), so guessing the BLT's location is probabilistically infeasible for attackers. No code or data sections contain any pointer references to BLT addresses; all references are computed dynamically at load-time and stored henceforth exclusively in protected registers. A. Bounding the Control Flow For each indirect branch site with (non-empty) destination set D, O-CFI guards the branch instruction with a bounds-check that continues execution only if the impending target t satisfies t ∈ [min D, max D]. Indirect branch instructions include all control-flow transfer instructions that target computed destinations, including return instructions. Failure of the boundscheck solicits immediate process termination with an error code (for easier debugging). Termination could be replaced with a different intervention if desired, such as an automated attack analysis or alarm, followed by restart and re-randomization. The bounds-check implementation first loads the pair (min D, max D) from the BLT into registers via an indirect, indexed memory reference. The load instruction's arguments and syntax are independent of the BLT's location, concealing its address from attackers who can read the checking code. The impending branch target t is then checked against the loaded bounds. If the check succeeds, execution continues; otherwise the process immediately terminates with a bounds range (#BR) exception. The #BR exception helps distinguish between crashes and guessing attacks. To resist guessing attacks (e.g., BROP), web servers and other services should use this exception to trigger re-randomization as they restart. Following the approaches of PittSFIeld To bypass these checks, an attacker must craft a payload whose every gadget is properly aligned and falls within the bounds of the preceding gadget's conclusory indirect branch. The odds of guessing a reachable series of such gadgets decrease exponentially with the number of gadgets in the desired payload. B. Opacifying Control-flow Bounds Diversifying bounds. The bounds introduced by O-CFI constitute a coarse-grained CFI policy. Section II warns that such coarse granularity can lead to vulnerabilities. However, to exploit such vulnerabilities, attackers must discover which control-flows adhere to the CFI policy and which do not. To make the impermissible flows opaque to attackers, we use diversity. Our prototype uses a modified version of the technique outlined by Wartell et al. Performing fine-grain code randomization at load-time indirectly randomizes the ranges used to bound the control-flow. In contrast to other CFI techniques, attackers therefore do not have a priori knowledge of the control-flow bounds. Preventing Information Leaks. Attackers bypass fine grained diversity using information leaks, such as those described in §II-A. Were O-CFI's control-flow bounds expressed as constants in the instruction stream, attackers could bypass our defense via information leaks. To avoid this, we instead confine this sensitive information to an isolated data page, the BLT. The BLT is initialized at a random virtual memory address at load-time, and there are no pointer references (obfuscated or otherwise) to any BLT address in any code or data page in the process. This keeps its location hidden from attackers. Furthermore, we take additional steps to prevent accidental BLT disclosure via pointer leaks. Our prototype stores BLT base addresses in segment selectors-a legacy feature of all x86 processors. In particular, each load from the BLT uses the gs segment selector and a unique index to read the correct bounds. We only use the gs selector for instructions that implement bounds checks, so there are no other instructions that adversaries can reuse to learn its value. Attackers are also prevented from executing instructions that reveal the contents of the segment registers, since such instructions are privileged. To succeed, attackers must therefore (i) guess branch ranges, or (ii) guess the base address of the BLT. The odds of correctly guessing the location of the BLT are low enough to provide probabilistic protection. On 32-bit Windows Systems, for instance, the chances of guessing the base address are or less than one in two billion. Incorrect guesses alert defenders and trigger re-randomization with high probability (by accessing an unallocated memory page). The likelihood of successfully guessing a reachable gadget chain is a function of the length of the chain and the span of the bounds. The next section therefore focuses on reducing the average bounds span. C. Tightening Control-flow Check Bounds The distance between the lowest and highest intended destinations of any given indirect control-flow transfer instruction depends on the code layout. Placing indirect branches close to their targets both reduces bounds and improves locality, elevating both security and efficiency. Therefore we organize the code segment into clusters-one per indirect branch-each containing the basic blocks targeted by a particular branch. To accommodate blocks that are destinations of multiple distinct branch instructions, we consider three options: (i) put the block in one cluster and expand the bounds of other branches to include its address, (ii) create duplicate copies of the block in multiple clusters, or (iii) add a portal block to each cluster, which unconditionally jumps to the block. Each solution incurs a trade-off: expanding bounds reduces security, creating duplicates increases code size, and portals introduce runtime overhead. The options are not mutually exclusive, affording optimizers a range of strategies. Our experiments indicate that portals are often the best choice (see below). The capacity of the portal system limits the number of portals per nexus. Varying nexus capacity allows O-CFI to be tuned to different requirements. Setting it to zero prevents the creation of any portals, forcing the optimizer to choose alternative options. At the other extreme, setting no upper limit allows a portal to be created for every target, reducing all bounds ranges to wt, where w is the alignment width (usually 16 bytes; see §V-A) and t is the number of targets of the branch. At this setting, all indirect branches can only branch into a nexus, and through them, only to exactly those addresses that have been statically identified as targets. Thus, O-CFI with unbounded nexus capacity enforces fine-grained, static CFI. The extra layer of indirection imposed by a portal has a minor impact on runtime; there is thus a trade-off between security and performance. Users may opt for full CFI enforcement with O-CFI for security-critical components, and lower the nexus capacity to a desired performance level for less critical software. In our experiments, we found that a nexus capacity of 12 results in a significant reduction in bounds sizes with imperceptible performance effects. All of our experiments in §V use this nexus capacity. Section V-D details how different nexus capacities affect bound ranges. D. Example Defense against JIT-ROP The following example illustrates how O-CFI secures binaries against disclosure attacks. Consider a binary whose code segment contains five useful gadgets g 1 , . . . , g 5 . Each gadget terminates in an indirect branch protected by a bounds check. Under appropriate conditions, a disclosure attack such as JIT-ROP is able to recover a large portion of the runtime layout of the binary In our example, if g 1 is selected to be part of the payload, it can only be chained with gadget g 4 or g 5 . Attempting to jump from g 1 to any other gadget triggers a bounds violation that stops the attack. Similarly, an attack that hijacks a controlflow to c 1 can only redirect it to gadgets g 1 , g 2 , or g 3 ; all other gadgets are outside cluster c 1 and are therefore detected as impermissible destinations of the hijacked branch. Broadly speaking, all links in a payload's chain must traverse edges in the Cartesian product of the (aligned) gadget sets within the corresponding clusters. A successful attack must therefore limit itself to an extremely sparse graph of available edges. Our experiments (see §V-C) indicate that in practice the probability of successfully chaining gadgets in such a sparse graph is very low-just 0.01% for a four-gadget payload. The entropy of our procedure is further analyzed in §VI-A. IV. O-CFI IMPLEMENTATION We have implemented a fully functional prototype of O-CFI for the Intel x86 architecture. Our implementation uses a binary rewriting framework that secures COTS x86 binaries without source, debug, or relocation information. Like traditional CFI, however, we emphasize that O-CFI is equally suitable for inclusion in a compiler. Our rewriter generates a transformed version of the binary that leverages 1) a coarse-grained CFI policy that bounds control-flows, 2) fine-grained randomization to thwart traditional ROP attacks and diversify control-flow bounds so they become unknown and unreliable for attackers, 3) x86 segmentation registers to prevent accidental leakage of the bounds lookup table (BLT), and 4) an SFI framework similar to PittSFIeld [29] to enforce instruction alignment, denying attackers access to misaligned instructions that bypass bounds checks. The architecture of O-CFI is shown in A. Static Binary Rewriting 1) Conservative Disassembly: We first disassemble the code section using a conservative disassembler. Similar to the approach outlined by Wartell et al. [45], the code section is duplicated, with the old copy (renamed to .told) serving as a read-only data segment and the new copy (called .tnew) containing the rewritten executable code. The .told section is set non-executable, and all code blocks identified as possible targets of indirect jumps are overwritten with a five-byte tagged pointer. The tagged pointer consists of a tag byte (0xF4) followed by the four-byte address of that block in the .tnew section. The tag byte facilitates efficient runtime redirection of stale pointers to their correct targets,

CiteSeerX

Purely interpretative optimizations

Author: Brunthaler Stefan
Publication venue
Publication date: 01/01/2011
Field of study

Zsfassung in dt. SpracheInterpreters are easy to implement and can be made portable with only little extra eﬀort. Therefore, many popular programming languages choose an interpreter instead of a compiler as an execution platform. While the characteristics of interpreters are regarded as an upside, usually, their major downside is considered to be sub-optimal performance. Fortunately, in 1984 L. Peter Deutsch and Allan Schiﬀman introduced the modern concept of dynamic compilation sub-systems to the programming language implementation community, which subsequently became a success story, resulting in today's high performance just-in-time compilers for Java and .NET. Unfortunately, on the other hand, deciding to implement a dynamic compilation sub-system involves trading off the valuable innate characteristics of interpreters.This dissertation a) explains why for some interpreters known techniques do not yield reported speedups, b) provides orientation for focusing on other optimization targets, and c) presents several purely-interpretative optimization techniques that result in substantial speedups-we report speedups of up to 2.4176-while simultaneously preserving the ease of implementation and portability characteristics.Interpretierer sind einfach zu implementieren und können mit marginalem Zusatzaufwand portabel gemacht werden. Daher werden viele populäre Programmiersprachen interpretiert und nicht traditionell kompiliert. Im Allgemeinen sind diese Eigenschaften von Interpretierern vorteilhaft, jedoch verfügen Interpretierer über einen gravierenden Nachteil: sub-optimale Ausführungsgeschwindigkeit.Glücklicherweise wurde 1984 von L. Peter Deutsch und Allan Schiﬀman das Verfahren zur dynamischen Übersetzung eingeführt und in weiterer Folge popularisiert, sodass wir heute über hoch-performante dynamische Übersetzter, wie zum Beispiel die Java virtual machine oder die .NET Umgebung verfügen. Unglücklicherweise bedeutet die Entscheidung einen dynamischen Übersetzer zu implementieren gleichzeitig auch, die ursprünglichen Eigenschaften eines Interpretierers, mithin die Einfachheit der Implementierung und deren Portabilität, größtenteils aufzugeben.Die vorliegende Dissertation a) klärt auf warum für bestimmte Arten von Interpretierern bereits bekannte Techniken nicht ihr übliches Potential entfalten, b) gibt eine Orientierung welche Arten von Optimierungen höheres Potential bieten und c) präsentiert mehrere rein-interpretative Optimierungstechniken mit substantiellem Optimierungs-Potential - Geschwindigkeitssteigerungen bis zu einem Faktor von 2.4176 sind möglich - jedoch ohne die Eigenschaften von Interpretierern zu beeinträchtigen.13

reposiTUm

Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization

Author: Brunthaler Stefan
Franz Michael
Na Yeoul
Qunaibit Mohaned
Volckaert Stijn
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 32nd European Conference on Object-Oriented Programming (ECOOP 2018)
Publication date: 01/01/2018
Field of study

Scientific applications are ideal candidates for the "heterogeneous computing" paradigm, in which parts of a computation are "offloaded" to available accelerator hardware such as GPUs. However, when such applications are written in dynamic languages such as Python or R, as they increasingly are, things become less straightforward. The same flexibility that makes these languages so appealing to programmers also significantly complicates the problem of automatically and transparently partitioning a program\u27s execution between a CPU and available accelerator hardware without having to rely on programmer annotations. A common way of handling the features of dynamic languages is by introducing speculation in conjunction with guards to ascertain the validity of assumptions made in the speculative computation. Unfortunately, a single guard violation during the execution of "offloaded" code may result in a huge performance penalty and necessitate the complete re-execution of the offloaded computation. In the case of dynamic languages, this problem is compounded by the fact that a full compiler analysis is not always possible ahead of time. This paper presents MegaGuards, a new approach for speculatively executing dynamic languages on heterogeneous platforms in a fully automatic and transparent manner. Our method translates each target loop into a single static region devoid of any dynamic type features. The dynamic parts are instead handled by a construct that we call a mega guard which checks all the speculative assumptions ahead of its corresponding static region. Notably, the advantage of MegaGuards is not limited to heterogeneous computing; because it removes guards from compute-intensive loops, the approach also improves sequential performance. We have implemented MegaGuards along with an automatic loop parallelization backend in ZipPy, a Python Virtual Machine. The results of a careful and detailed evaluation reveal very significant speedups of an order of magnitude on average with a maximum speedup of up to two orders of magnitudes when compared to the original ZipPy performance as a baseline. These results demonstrate the potential for applying heterogeneous computing to dynamic languages

Dagstuhl Research Online Publication Server

Con sis te n t * Comple te * W ell Docu m e n te d * Easy to R e us e * Accelerating Iterators in Optimizing AST Interpreters

Author: Michael Franz
Per Larsen
Stefan Brunthaler
Wei Zhang
Publication venue
Publication date: 05/03/2020
Field of study

Abstract Generators offer an elegant way to express iterators. However, performance has always been their Achilles heel and has prevented widespread adoption. We present techniques to efficiently implement and optimize generators. We have implemented our optimizations in ZipPy, a modern, light-weight AST interpreter based Python 3 implementation targeting the Java virtual machine. Our implementation builds on a framework that optimizes AST interpreters using just-in-time compilation. In such a system, it is crucial that AST optimizations do not prevent subsequent optimizations. Our system was carefully designed to avoid this problem. We report an average speedup of 3.58× for generatorbound programs. As a result, using generators no longer has downsides and programmers are free to enjoy their upsides. Motivation Many programming languages support generators, which allow a natural expression of iterators. We surveyed the use of generators in real Python programs, and found that among the 50 most popular Python projects listed on the Python Package Index (PyPI) Generators provide programmers with special controlflow transfers that allows function executions to be suspended and resumed. Even though these control-flow transPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. fers require extra computation, the biggest performance bottleneck is caused by preserving the state of a function between a suspend and a resume. This bottleneck is due to the use of cactus stacks required for state preservation. Popular language implementations, such as CPython [20], and CRuby [22], allocate frames on the heap. Heap allocation eliminates the need for cactus stacks, but is expensive on its own. Furthermore, function calls in those languages are known to be expensive as well. In this paper, we examine the challenges of improving generator performance for Python. First, we show how to efficiently implement generators in abstract syntax tree (AST) interpreters, which requires a fundamentally different design than existing implementations for bytecode interpreters. We use our own full-fledged prototype implementation of Python 3, called ZipPy 1 , which targets the Java virtual machine (JVM). ZipPy uses the Truffle framework [26] to optimize interpreted programs in stages, first collecting type feedback in the AST interpreter, then just-in-time compiling an AST down to optimized machine code. In particular, our implementation takes care not to prevent those subsequent optimizations. Our efficient generator implementation optimizes control-transfers via suspend and resume. Second, we describe an optimization for frequently used idiomatic patterns of generator usage in Python. Using this optimization allows our system to allocate generator frames to the native machine stack, eliminating the need for heap allocation. When combined, these two optimizations address both bottlenecks of using generators in popular programming languages, and finally give way to high performance generators. Summing up, our contributions are: • We present an efficient implementation of generators for AST based interpreters that is easy to implement and enables efficient optimization offered by just-in-time compilation. • We introduce generator peeling, a new optimization that eliminates overheads incurred by generators. • We provide results of a careful and detailed evaluation of our full-fledged prototype and report: 1 Publicly available at https://bitbucket.org/ssllab/zipp

CiteSeerX

Booby Trapping Software

Author: Michael Franz
Per Larsen
Stefan Brunthaler
Stephen Crane
Publication venue
Publication date
Field of study

Cyber warfare is asymmetric in the current paradigm, with attackers having the high ground over defenders. This asymmetry stems from the situation that attackers have the initiative, while defenders concentrate on passive fortifications. Defenders are constantly patching the newest hole in their defenses and creating taller and thicker walls, without placing guards on those walls to watch for the enemy and react to attacks. Current passive cyber security defenses such as intrusion detection, anti-virus, and hardened software are not sufficient to repel attackers. In fact, in conventional warfare this passivity would be entirely nonsensical, given the available active strategies, such as counterattacks and deception. Based on this observation, we have identified the technique of booby trapping software. This extends the arsenal of weaponry available to defenders with an active technique for directly reacting to attacks. Ultimately, we believe this approach will restore some of the much sought after equilibrium between attackers and defenders in the digital domain

CiteSeerX